Skip to content

Use web2json agent to clean html for v2 project#88

Open
dreamGirl1996 wants to merge 2 commits intoccprocessor:mainfrom
dreamGirl1996:user/luqing
Open

Use web2json agent to clean html for v2 project#88
dreamGirl1996 wants to merge 2 commits intoccprocessor:mainfrom
dreamGirl1996:user/luqing

Conversation

@dreamGirl1996
Copy link
Copy Markdown

No description provided.

@dreamGirl1996 dreamGirl1996 changed the title User/luqing Use web2json agent to clean html for v2 project Apr 14, 2026
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notebooks/ 下的这两个是用于spark上执行的脚本吗

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这两个文件不是给 Spark 用的,是为了在 Jupyter/Notebook 里本
地调试和演示 web2json 流程加的辅助脚本。主要做 notebook 环境初始化、组装配置,
以及从 notebook 里直接调用整条 pipeline。

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

你的输入的是jsonl,需要用这个脚本提取html字段内容作为web2json的输入吧?

Copy link
Copy Markdown
Collaborator

@1041206149 1041206149 Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

新增的代码精简逻辑解决了什么问题,可以comment或者飞书文档内说明下

Comment thread web2json/simple.py
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里添加了什么功能

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

稍微说明一下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants